Before installation, please make sure:
using R (>= 3.5.0)
installed certain dependent R packages: devtools, SuperLearner(>= 2.0-28)
installed other suggested R packages depend on what ML model would be used, eg. ranger, gbm, xgboost, gam et al.
The developing version of deeper can be found from github.
Using the following syntax to install:
library(devtools)
install_github("Alven8816/deeper")
Install deeper R package.
Outstanding model performance. DEML is an extension of a SuperLearner(SL) ensemble algorithm (Naimi and Balzer 2018; Polley and Van Der Laan 2010; Van der Laan et al. 2007) by introducing the neural network hierarchy structure. Generally, it achieved higher estimation accuracy than SL and other single ML models.
Assessing constructed individual models simultaneously. DEML can evaluate the performance of all constructed individual models simultaneously and generate the optimal weights for each model.
Customizing the model structure and algorithms. You can modify the structure of DEML and achieve the best one for your tasks.
Minimizing errors from empirical experience. DEML not only evaluates the performance of all constructed models simultaneously but also automatically selects an optimal integration, as well as the selection of hyper-parameters.
Easy to use and extent. You do not need to have a high-steep learning curve to learn how it works.
The details about the DEML can be found here.
The DEML framework proposed above is a three-level stacked ensemble approach. It is based on the SuperLearner(SL) ensemble algorithm (Naimi and Balzer 2018; Polley and Van Der Laan 2010; Van der Laan et al. 2007) introduced in the neural network hierarchy structure.
To estimate the daily ambient PM2.5 in the northeast of China in 2015-2016
# load library
library(ggplot2)
library(caret)
library(skimr)
library(CAST)
library(SuperLearner)
library(deeper)
data("envir_example")
| date | code | year | month | week | PM2.5 | c_AOD | TEMP |
|---|---|---|---|---|---|---|---|
| 2/01/2015 | 1001A | 2015 | 1 | 5 | 51.26 | 0.22 | -2.76 |
| 3/01/2015 | 1001A | 2015 | 1 | 6 | 154.04 | 0.92 | -3.76 |
| 4/01/2015 | 1001A | 2015 | 1 | 0 | 151.92 | 0.39 | -1.30 |
| 6/01/2015 | 1001A | 2015 | 1 | 2 | 39.55 | 0.30 | -2.22 |
| 8/01/2015 | 1001A | 2015 | 1 | 4 | 148.38 | 1.05 | -3.52 |
| 9/01/2015 | 1001A | 2015 | 1 | 5 | 87.78 | 0.18 | -1.59 |
The basic data clean strategies include:
Variable type setting
Extreme value (outliers) detection
Missing value operation (imputation, drop)
Data transforming (normalization/standardization, eg. scale, centralize,log-transform and others)
# skim the data missing value and distribution
skimr::skim(envir_example)
| skim_type | skim_variable | n_missing | complete_rate | numeric.mean | numeric.sd |
|---|---|---|---|---|---|
| numeric | RH | 0 | 1 | 50.6 | 17.3 |
| numeric | elevation | 0 | 1 | 90.4 | 108.0 |
| numeric | WS | 0 | 1 | 2.0 | 0.8 |
| numeric | a_buffer10 | 0 | 1 | 103.3 | 29.1 |
We randomly select 20% of the data as independent testing dataset, and the remainder were used as the training dataset.
The split strategy is based on the size of your sample as well as your question.
set.seed(1234) # to achieve a repeatable results
size <-
caret::createDataPartition(y = envir_example$PM2.5,
p = 0.8, list = FALSE)
trainset <- envir_example[size, ]
testset <- envir_example[-size, ]
Identify the dependence and independence variables
y <- c("PM2.5")
x <- colnames(envir_example[-c(1, 6)]) # except "date" and "PM2.5"
Q: Estimate annual average NO2 in Sydney in 2005-2018
Download the Sydney NO2 data in the CloudStor
Tasks:
Setting the dependence (“no2_annual”) and others as independent variables
Split 10% of data as testing data set
#try it here
Q: How to select the best parameter for a single base model?
We can set or adjust the parameters of a base model using ‘tuningModel’ function.
ranger <-
tuningModel(
basemodel = 'SL.ranger',
params = list(num.trees = 100),
tune = list(mtry = c(1, 3, 7))
)
Here we will train a Random Forest (RF) model with the specific parameters and using 5-fold Cross validation (CV) to assess the model performance.
# training the RF model with different parameters simultaneously
start_time <- Sys.time()
model1 <-
predictModel(
Y = trainset[, y],
X = trainset[, x],
base_model = c(ranger),
cvControl = list(V = 5)
)
##
##
## The base models Cross validation result:
##
## SL.ranger_1_All SL.ranger_2_All SL.ranger_3_All
## weight 0.0000000 0.3004660 0.6995340
## R2 0.7978256 0.8749765 0.8752692
## RMSE 35.2163342 23.8228438 23.2379626
end_time <- Sys.time()
end_time - start_time
## Time difference of 4.001203 secs
#print(model1$base_ensemble_value)
The training results show that ‘mtry = 7’ could achieve a better RF model performance.
After training a base ML model, we can use it to estimate the independent testing dataset by using ‘predict’ function.
Note: ’predict()’function was recommended to limit its sources (namespace) to reduce the conflict with other R package such as stats::predict()
# compare the model performance in the independent testing dataset
pred_model1 <- deeper::predict(object = model1, newX = testset[, x])
# get each base model prediction
head(pred_model1$pre_base$library.predict)
## SL.ranger_1_All SL.ranger_2_All SL.ranger_3_All
## [1,] 52.90729 47.68343 48.09790
## [2,] 61.50432 66.76404 67.36444
## [3,] 68.93616 77.79895 62.86596
## [4,] 66.66083 73.91816 77.26067
## [5,] 147.77336 177.64796 174.72512
## [6,] 56.60838 51.01905 49.36100
# calculate model performance in testing dataset
print(apply(
X = pred_model1$pre_base$library.predict,
MARGIN = 2,
FUN = caret::R2,
obs = testset[, y]
))
## SL.ranger_1_All SL.ranger_2_All SL.ranger_3_All
## 0.7843890 0.8562903 0.8780889
After examine the model performance in an independent testing dataset, we may finally select the RF model with mtry = 7 as the best parameters.
Tasks:
Establish a Random Forest model with parameter: mtry = 7 and leave others as default
Establish another base model ‘xgboost’ simultaneously by setting base_model = “SL.xgboost”
#try it here
Considering the time-consuming of running several base models simultaneously, we can select using parallel computing to help improve the computational efficiency.
We can identify the index in the cross validation(CV) to conduct the spatial (cluster) or temporal CV
# there are 7 stations in the trainset
unique(trainset$code)
## [1] 1001A 1002A 1003A 1004A 1005A 1006A 1007A
## Levels: 1001A 1002A 1003A 1004A 1005A 1006A 1007A
# Create a list with 7 (folds) elements (each element contains index of rows to be considered on each fold)
## conduct the spatial CV
indices <-
CAST::CreateSpacetimeFolds(trainset, spacevar = "code", k = 7)
# Rows of validation set on each fold
v_raw <- indices$indexOut
names(v_raw) <- seq(1:7)
start_time <- Sys.time()
model2 <- predictModel_parallel(
Y = trainset[, y],
X = trainset[, x],
base_model = c("SL.xgboost", "SL.ranger"),
cvControl = list(V = length(v_raw), validRows = v_raw),
number_cores = 4,
seed = 1234
)
##
##
## The base models Cross validation result:
##
## SL.xgboost_All SL.ranger_All SL.predict
## weight 0.3776465 0.6223535 1.000000
## R2 0.8596721 0.8769333 0.988768
## RMSE 24.3083180 23.5561368 7.618139
end_time <- Sys.time()
end_time - start_time
## Time difference of 11.51532 secs
## when number_cores is missing, it will indicate user to set one based on the operation system.
# pred_m3 <- predictModel_parallel(
# Y = trainset[,y],
# X = trainset[,x],
# base_model = c("SL.xgboost",ranger),
# cvControl = list(V = length(v_raw), validRows = v_raw),
# seed = 1
# )
#You have 8 cpu cores, How many cpu core you want to use:
# type the number to continue the process.
# prediction
pred_model2 <- deeper::predict(object = model2, newX = testset[, x])
After assessing the performances of base models, we now can move forward to the DEML by stacking meta models on it.
#Do include original feature
#object do not include newX dataset
model3_stack <-
stack_ensemble(
object = model1,
meta_model = c("SL.ranger", "SL.xgboost", "SL.glm"),
original_feature = FALSE,
X = trainset[, x]
)
##
## The meta models cross validation results:
##
## SL.ranger_All SL.xgboost_All SL.glm_All
## weight 0.1322739 0.0000000 0.8677261
## R2 0.8707404 0.8490553 0.8815373
## RMSE 23.3403989 25.3698735 22.3156117
## the training results
#model3_stack$stack_ensemble_value
# achieving independent testing results
model3_DEML <-
deeper::predict(object = model3_stack, newX = testset[, x])
print(apply(
X = cbind(model3_DEML$pre_meta$library.predict,
model3_DEML$pre_meta$pred),
MARGIN = 2,
FUN = caret::R2,
obs = testset[, y]
))
## SL.ranger_All SL.xgboost_All SL.glm_All
## 0.8648366 0.8537735 0.8738125 0.8738318
print(apply(
X = cbind(model3_DEML$pre_meta$library.predict,
model3_DEML$pre_meta$pred),
MARGIN = 2,
FUN = caret::RMSE,
obs = testset[, y]
))
## SL.ranger_All SL.xgboost_All SL.glm_All
## 24.53393 25.59143 23.70075 23.69987
We can create DEML directly by setting the base models and meta models. But considering the unknown impact of the underlying model and computation time, this is not recommended.
model4_stack <-
stack_ensemble.fit(
Y = trainset[, y],
X = trainset[, x],
base_model = c("SL.xgboost", ranger),
meta_model = c("SL.ranger", "SL.xgboost", "SL.glm"),
original_feature = FALSE
)
##
##
## The base models Cross validation result:
##
## SL.xgboost_All SL.ranger_1_All SL.ranger_2_All SL.ranger_3_All
## weight 0.5093405 0.0000000 0.1946900 0.2959695
## R2 0.8843921 0.8020241 0.8840721 0.8842767
## RMSE 22.0529810 34.4452850 22.8514683 22.3367349
##
## The meta models cross validation results:
##
## SL.ranger_All SL.xgboost_All SL.glm_All
## weight 0.3955923 0.09226124 0.5121465
## R2 0.8976871 0.88606742 0.8980186
## RMSE 20.7446071 21.90717443 20.7052018
We also accelerate the calculation with paralleling computing in DEML.
Several key points are worthy to note:
If the base model used parallel computing, the meta-model also needs to be parallel.
When setting a specific CV index, please be consistent in meta-model training.
Do not use all of your computation cores to do paralleling. Leave at least one for your operating system.
‘Original_feature’ is optional. It may generally improve your model performance but increase the computational complexity.
#Do not include original feature
model5_stack <-
stack_ensemble_parallel(
object = model2,
Y = trainset[, y],
meta_model = c("SL.ranger", "SL.xgboost", "SL.glm"),
original_feature = FALSE,
cvControl = list(V = length(v_raw), validRows = v_raw),
number_cores = 4
)
##
## The stack ensemble cross validation value:
##
## SL.ranger_All SL.xgboost_All SL.glm_All deeper
## weight 0.1747935 0.0000000 0.8252065 1.0000000
## R2 0.8615084 0.8463332 0.8729455 0.9019219
## RMSE 24.1577567 25.4904753 23.1140687 20.3172672
# the training results
# the testing results
pred_model5_stack <-
deeper::predict(object = model5_stack, newX = testset[, x])
Tasks:
Stacking meta models using RF and Xgboost (with original features) in Sydeny data to conduct DEML model
Achieving the final DEML model performance in testing dataset
#try it here
Download the Sydney NO2 data in the CloudStor
Task:
To estimate 10km grid cell yearly NO2 in Sydney.
#try it here
We can finally have the scatter plot using ‘assess.plot’ function in deeper.
plot_DEML <-
assess.plot(pre = model3_DEML$pre_meta$pred, obs = testset[, y])
print(plot_DEML$plot)
Wenhua Yu, Shanshan Li, Tingting Ye,Rongbin Xu, Jiangning Song, Yuming Guo (2022) Deep ensemble machine learning framework for the estimation of PM2.5 concentrations,Environmental health perspectives: https://doi.org/10.1289/EHP9752